Special Topics in R

Module 5.2: Text as Data

Author
Affiliation

Alex Cardazzi

Old Dominion University

All materials can be found at alexcardazzi.github.io.

Housing Price Models

Text as Data

Often times, there’s useful information contained in the columns with text. Or, perhaps numeric data is formatted in a non-numeric. The easiest example of this is date or time data.

In this portion of the module, we’ll go through ways of working with text and dates.

We are going to focus on two datasets as examples:

lubridate

Before we get too into the weeds of text data, we should introduce a new package called lubridate.2 This package facilitates working with dates which are a weird mix of text and numbers.

Let’s go through some of the important functions / capabilities in lubridate.

Note: There is a great cheatsheet available online.

ymd(), dmy(), mdy(): when you read data into R and one of the columns has entries like "January 2 2014", we can use these functions to help us out. Note that y stands for year, d stands for day, and m stands for month, so just choose the function that makes sense for your situation.

Code
library("lubridate")

mdy("January 2 2014")
ymd("2015/08/10")
dmy("29-10-1978")
ymd("2015-02-30")
Output
[1] "2014-01-02"
[1] "2015-08-10"
[1] "1978-10-29"
[1] NA

Output from these functions appear in year-month-day format, but are actually numeric under the hood. We can see this by executing as.numeric(mdy("January 2 2014")): 16072.

Once your data is converted to well-behaved date objects, we can begin to manipulate them. We can extract the year, month, day, etc. using lubridate’s helpfully named year(), month(), and day() functions. We can also grab the weekday by using wday(df$date, label = TRUE).

Next, we can use floor_date(df$date, "quarter") “round” our dates to the nearest quarter, month, year, whatever.

Finally, we can calculate differences in dates by using simple subtraction. However, often times we want to know the number of, say, months between two dates. For this, we can use the following code: interval(ymd(date1),ymd(date2)) %% months(1).

We’ll go through some examples in a moment.

Text as Data

Date data is a very particular type of text data. However, working with text more generally is equally important when programming. We will not use any new packages for this, as base R has plenty of functions for us.

The first function we’ll take a look at is paste(), which has a sibling function paste0().

paste() takes whatever vectors you give it and smushes them together. paste() will recycle the shorter of the vector(s) until it matches the length of the longest vector. paste() will separate text with spaces by default, but this can be changed via the sep argument. paste0() defaults to sep = "", or nothing inbetween text. Some examples are below:

Code
paste("alex", "cardazzi"); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alexander", "alex"), c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex"), "middle-name", c("cardazzi", "trebek")); cat("\n")
paste("alex", c("cardazzi", "trebek", "hamilton"), sep = "-"); cat("\n")
paste0("alex", c("cardazzi", "trebek", "hamilton")); cat("\n")
paste(c("alex", "christine"), c("cardazzi", "strong"), sep = " abcxyz ")
Output
[1] "alex cardazzi"

[1] "alex cardazzi" "alex trebek"   "alex hamilton"

[1] "alexander cardazzi" "alex trebek"        "alexander hamilton"

[1] "alex middle-name cardazzi" "alex middle-name trebek"  

[1] "alex-cardazzi" "alex-trebek"   "alex-hamilton"

[1] "alexcardazzi" "alextrebek"   "alexhamilton"

[1] "alex abcxyz cardazzi"    "christine abcxyz strong"

Next, rather than combining character strings, we can break them up. There are a few ways to do this, but let’s work with substr() first. substr() accepts three arguments: x, start, and stop. Effectively, you tell R the “full” character vector and then the start and stop positions of the characters you want. Some examples are below:

Code
char <- c("julius randle", "jalen brunson")
substr(char, 1, 3)
# use `regexpr()` to give you the position of
#   a certain sub-string in another string.
substr(char, 1, regexpr(" ", char)); cat("\n")
# use -1 to *not* get the substring you're looking for
substr(char, 1, regexpr(" ", char) - 1); cat("\n")
# when using regexpr(), if the substring does not appear,
#   it will return -1, and nothing will be returned.
substr(char, 1, regexpr("b", char)); cat("\n")
# use nchar() to get the length of the string.
substr(char, regexpr(" ", char), nchar(char)); cat("\n")
# here, use +1 to avoid the space
substr(char, regexpr(" ", char) + 1, nchar(char))
Output
[1] "jul" "jal"
[1] "julius " "jalen " 

[1] "julius" "jalen" 

[1] ""        "jalen b"

[1] " randle"  " brunson"

[1] "randle"  "brunson"

To check if a certain substring exists in another string, we can use grepl(). This accepts two arguments: a single substring and a vector of other strings. Examples follow:

Code
str <- c("mitch robinson", "rj barrett", "patrick ewing")
grepl("b", str); cat("\n")
# You can also search for multiple thing, like "b or r", by:
grepl("b|r", str)
Output
[1]  TRUE  TRUE FALSE

[1] TRUE TRUE TRUE

Similarly, we can use gsub() to find and replace things.

Code
str <- c("mitch robinson", "rj barrett", "patrick ewing")
gsub("b", "z", str); cat("\n")
# to delete things, replace it with ""
grepl("b|r", "", str)
Output
[1] "mitch rozinson" "rj zarrett"     "patrick ewing" 

[1] FALSE

Regular Expressions

Working with text patterns can get a bit crazy. There’s a whole mini-language for this that we won’t fully dive into, but I want to expose you to some of it. Here are some important ones:

  • \\d: this is all digits 0-9. This is very helpful if you want to quickly find and replace all numeric values.
  • \\D: this is the opposite – all non-digit characters.
  • [[:alpha:]]: all alphabetic characters.
  • \\s: space, tab, new line, etc. This is helpful when there’s a bunch of spaces you want to delete.
  • [[:punct:]]: all punctuation characters.
  • .: this means any character.
  • +: means match the previous character at least once. This is helpful when there are many, for example, spaces in a row. You can use gsub("\\s+", "", txt).
  • ^: indicates the beginning of a string.
  • $: indicates the end of a string.

Below are some examples of how to use regular expressions:

Code
namez <- c("margot elise robbie",
           "samuel l jackson",
           "jennifer lawrence")
# find space-something-space and replace with nothing.
gsub("\\s.+\\s", "", namez); cat("\n")
# find letters-space and replace with nothing
gsub("[[:alpha:]]+\\s", "", namez); cat("\n")
# find start_of_string-letters-space and replace with nothing
gsub("^[[:alpha:]]+\\s", "", namez); cat("\n")
# find space-letters-end_of_string and replace with nothing
gsub("\\s[[:alpha:]]+$", "", namez)
Output
[1] "margotrobbie"      "samueljackson"     "jennifer lawrence"

[1] "robbie"   "jackson"  "lawrence"

[1] "elise robbie" "l jackson"    "lawrence"    

[1] "margot elise" "samuel l"     "jennifer"    

Regular expressions are very difficult, but very convenient once you get the hang of them. Just like lubridate, there’s a fantastic cheatsheet online you should check out.

Knicks Roster

Let’s begin by practicing on the Knicks roster:

Code
knicks <- read.csv("https://alexcardazzi.github.io/econ311/data/knicks23.csv")
knicks
Output
    X  No.             Player Pos   Ht  Wt        Birth.Date Var.7 Exp             College                   bbrefID
1   1   51   Ryan Arcidiacono  PG  6-3 195    March 26, 1994    us   5           Villanova /players/a/arcidry01.html
2   2    9         RJ Barrett  SG  6-6 214     June 14, 2000    ca   3                Duke /players/b/barrerj01.html
3   3   11      Jalen Brunson  PG  6-2 190   August 31, 1996    us   4           Villanova /players/b/brunsja01.html
4   4   13      Evan Fournier  SG  6-7 205  October 29, 1992    fr  10                     /players/f/fournev01.html
5   5    6     Quentin Grimes  SG  6-5 205       May 8, 2000    us   1     Kansas, Houston /players/g/grimequ01.html
6   6    3          Josh Hart  SF  6-5 215     March 6, 1995    us   5           Villanova  /players/h/hartjo01.html
7   7   55 Isaiah Hartenstein   C  7-0 250       May 5, 1998    us   4                     /players/h/harteis01.html
8   8    8    DaQuan Jeffries  SG  6-5 230   August 30, 1997    us   3 Oral Roberts, Tulsa /players/j/jeffrda01.html
9   9 0, 3       Trevor Keels  SG  6-5 221   August 26, 2003    us   R                Duke /players/k/keelstr01.html
10 10    2      Miles McBride  PG  6-2 200 September 8, 2000    us   1       West Virginia /players/m/mcbrimi01.html
11 11   17     Svi Mykhailiuk  SF  6-7 205     June 10, 1997    ua   4              Kansas /players/m/mykhasv01.html
12 12    5  Immanuel Quickley  SG  6-3 190     June 17, 1999    us   2            Kentucky /players/q/quickim01.html
13 13   30      Julius Randle  PF  6-8 250 November 29, 1994    us   8            Kentucky /players/r/randlju01.html
14 14    0        Cam Reddish  SF  6-8 218 September 1, 1999    us   3                Duke /players/r/reddica01.html
15 15   23  Mitchell Robinson   C  7-0 240     April 1, 1998    us   4    Western Kentucky /players/r/robinmi01.html
16 16    4       Derrick Rose  PG  6-3 200   October 4, 1988    us  13             Memphis  /players/r/rosede01.html
17 17   45       Jericho Sims   C 6-10 245  October 20, 1998    us   1               Texas  /players/s/simsje01.html
18 18    1         Obi Toppin  PF  6-9 220     March 4, 1998    us   2              Dayton /players/t/toppiob01.html

Let’s calculate the following items:

  • The fraction of players who are either a point guard (PG) or shooting guard (SG).
  • Each player’s experience in the league in years.
  • Each player’s height in inches.
  • Each player’s exact age in days on the day of the first game of the 2022-23 season (Oct. 19, 2022)
  • The player who is closest in age to Jalen Brunson, the indisputable best player on the team.

First, let’s examine which players are guards. To do this, we are going to use grepl() to search for "G" in the Pos column. If we find it, we are going to give this player a 1. If we don’t, we are going to give them a 0.

Code
knicks$guard <- ifelse(grepl("G", knicks$Pos), 1, 0)
head(knicks[,c("Player", "Pos", "guard")])
cat("Percent of the roster that are guards:",
    round(mean(knicks$guard), 2)*100, "%")
Output
            Player Pos guard
1 Ryan Arcidiacono  PG     1
2       RJ Barrett  SG     1
3    Jalen Brunson  PG     1
4    Evan Fournier  SG     1
5   Quentin Grimes  SG     1
6        Josh Hart  SF     0
Percent of the roster that are guards: 56 %

Next, let’s calculate the team’s average experience in the league. However, as you might have noticed, rookie players (meaning players who have never played in the league before) have an “R” for their experience. If we calculate an average of this vector, R will return NA because of this. If we simply drop the “R”/NA values, this will overstate the average experience since these rookies should have values equal to zero. Let’s replace “R” with 0.

Code
mean(as.numeric(knicks$Exp))
mean(as.numeric(knicks$Exp), na.rm = TRUE)
knicks$Exp <- gsub("R", 0, knicks$Exp)
mean(as.numeric(knicks$Exp))
Output
[1] NA
[1] 4.294118
[1] 4.055556

The next bullet wants us to calculate player heights. To do this, we are going to grab the first part of their height (feet), multiply by 12 to get inches, and then add in the second part of their height (inches).

Code
knicks$foot <- as.numeric(gsub("-.+", "", knicks$Ht))
knicks$inch <- as.numeric(gsub(".-", "", knicks$Ht))
knicks$Ht_inch <- (knicks$foot*12) + knicks$inch
head(knicks[,c("Player", "Ht", "foot", "inch", "Ht_inch")])
Output
            Player  Ht foot inch Ht_inch
1 Ryan Arcidiacono 6-3    6    3      75
2       RJ Barrett 6-6    6    6      78
3    Jalen Brunson 6-2    6    2      74
4    Evan Fournier 6-7    6    7      79
5   Quentin Grimes 6-5    6    5      77
6        Josh Hart 6-5    6    5      77

Now we are going to calculate a players age in days at the start of the season. To do this, we have to convert their birthday-text into a numeric birthday object. Then, we are going to subtract the specific date from the birthdays vector. This will return the difference in days. We can also take the difference in months (or years, quarters, etc.) which will be demonstrated below.

Code
library("lubridate")
knicks$bday <- mdy(knicks$Birth.Date)
knicks$age <- as.numeric(ymd("2022-10-19") - knicks$bday)
knicks$age_m <- interval(knicks$bday, ymd("2022-10-19")) %/% months(1)
head(knicks[,c("Player", "Birth.Date", "bday", "age", "age_m")])
Output
            Player       Birth.Date       bday   age age_m
1 Ryan Arcidiacono   March 26, 1994 1994-03-26 10434   342
2       RJ Barrett    June 14, 2000 2000-06-14  8162   268
3    Jalen Brunson  August 31, 1996 1996-08-31  9545   313
4    Evan Fournier October 29, 1992 1992-10-29 10947   359
5   Quentin Grimes      May 8, 2000 2000-05-08  8199   269
6        Josh Hart    March 6, 1995 1995-03-06 10089   331

Finally, we are going to calculate each player’s difference in age to Jalen Brunson. First, we are going to calculate JB’s age, and then subtract his age from everyone else’s age. To finish, we are going to sort the data by this difference.

Code
knicks2 <- knicks
jb_age <- knicks2$age[knicks2$Player == "Jalen Brunson"]
knicks2$jb_age_diff <- abs(knicks2$age - jb_age)
knicks2 <- knicks2[order(knicks2$jb_age_diff),]
head(knicks2[,c("Player", "jb_age_diff")])
Output
              Player jb_age_diff
3      Jalen Brunson           0
11    Svi Mykhailiuk         283
8    DaQuan Jeffries         364
6          Josh Hart         544
18        Obi Toppin         550
15 Mitchell Robinson         578

Police Stops

Next, we are going to explore some crime data. The Minneapolis police record information on stops they conduct such as location and date/time. Let’s generate some numeric data from both of these columns.

First, let’s read in the data and view it.

Code
pd <- read.csv("https://alexcardazzi.github.io/econ311/data/minn_stops.csv")
print(head(pd, 10))
Output
          id                 date    problem    race  gender                 location precinct
1  17-036337 2017-02-01T00:00:12Z    traffic   White    Male 44.95134737 -93.28133076        5
2  17-036349 2017-02-01T00:07:38Z suspicious   Black  Female  44.9474742 -93.29829195        5
3  17-036351 2017-02-01T00:11:39Z    traffic   Other    Male       44.89233 -93.28067        5
4  17-036357 2017-02-01T00:18:50Z suspicious  Latino    Male       45.01497 -93.24734        2
5  17-036360 2017-02-01T00:22:41Z    traffic   Black    Male 45.00951934 -93.28989378        4
6  17-036378 2017-02-01T00:39:23Z suspicious Unknown Unknown 44.94047824 -93.26763786        3
7  17-036380 2017-02-01T00:39:53Z suspicious   Black    Male  44.9784072 -93.27881908        1
8  17-036381 2017-02-01T00:40:11Z    traffic   White  Female 44.97429462 -93.27996035        1
9  17-036388 2017-02-01T00:46:54Z    traffic   Black  Female        44.94659 -93.2797        5
10 17-036390 2017-02-01T00:48:37Z    traffic   White    Male 44.97526008 -93.26992525        1

Next, let’s convert the datetime text into something usable with lubridate. Since there’s time information in this, we need to use the still-conveniently named ymd_hms function. Below we’ll generate some distribution plots of the weekday, hour, and minute of these police stops.

Code
pd$timestamp <- ymd_hms(pd$date)

par(mfrow = c(1, 3))
plot(table(wday(pd$timestamp, label = T)),
     xlab = "Weekday", ylab = "Frequency")
plot(table(hour(pd$timestamp)),
     xlab = "Hour", ylab = "")
plot(table(minute(pd$timestamp)),
     xlab = "Minute", ylab = "")
par(mfrow = c(1, 1))
Plot

Distribution plots of weekdays, hours, and minutes.

Finally, we can extract latitude and longitude from the loc column, and plot the resulting points as a map.

Code
pd$lon <- gsub(".+\\s", "", pd$location)
pd$lat <- gsub("\\s.+", "", pd$location)
plot(pd$lon, pd$lat, pch = 19, las = 1,
     col = scales::alpha("black", 0.1),
     xlab = "", ylab = "")
Plot

Plotting geolocation of police stops in Minneapolis.

There’s a whole subset of data that we have not talked about yet that has to do with spatial features of data. Here, I am going to plot Minneapolis neighborhoods with the stops data on top of it. There’s tons of things you can do with this, but it is outside the scope of this course.

Code
minn <- sf::read_sf("https://raw.githubusercontent.com/blackmad/neighborhoods/master/minneapolis.geojson")
par(mar = c(0, 0, 0, 0))
plot(minn$geometry, col = scales::alpha("dodgerblue", 0.3))
points(pd$lon, pd$lat, pch = 20,
       col = scales::alpha("black", 0.2))
Plot

Plotting geolocation of police stops in Minneapolis.

Footnotes

  1. Note: I have modified this file for the purposes of this class.↩︎

  2. Yeah, the name is hilarious. Mount Rushmore of package names.↩︎